Skip to content

fix(databricks_zerobus sink): defer Unity Catalog schema fetch out of build()#25408

Merged
petere-datadog merged 9 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-defer-descriptor
May 27, 2026
Merged

fix(databricks_zerobus sink): defer Unity Catalog schema fetch out of build()#25408
petere-datadog merged 9 commits into
vectordotdev:masterfrom
flaviofcruz:zerobus-defer-descriptor

Conversation

@flaviofcruz

@flaviofcruz flaviofcruz commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

databricks_zerobus's SinkConfig::build() synchronously calls Unity Catalog to fetch the table's protobuf descriptor. If the table doesn't exist or credentials are wrong, the call fails inside build() and Vector exits before the sink even starts — even when healthcheck.enabled = false.

This aligns the sink with the convention used by AWS S3, Kafka, etc.: build() does only local setup, the healthcheck does the remote probe, and runtime failures surface per-batch via the existing retry/event-status path.

Changes:

  • UC descriptor + encoder + stream-mode are resolved lazily via tokio::sync::OnceCell on first use (and re-attempted on each failure).
  • Event encoding moves from ZerobusSink into ZerobusService::encode_batch, since the encoder now depends on the lazily-resolved descriptor.
  • ZerobusService::new only builds the SDK client and the HTTP client (both local).
  • ensure_stream (the healthcheck) is the natural gate: it triggers schema resolution + stream creation, and runs the same way on first ingest if the healthcheck is disabled.
  • Replaced the eager-ProxyConfig-stash with an HttpClient built once in new, so fetch_table_schema takes &HttpClient and doesn't reconstruct it per retry.
  • Added logic to decide whether schema fetch errors should be treated as retryable or not. If UC is transiently failing, we will try again during ingestion. Potentially this opens the door to dynamically re-fetch the schema as time goes on.

Vector configuration

  sinks:
    zb:
      type: databricks_zerobus
      inputs: [demo]
      ingestion_endpoint: "https://ingest.dev.databricks.com"
      unity_catalog_endpoint: "https://workspace.cloud.databricks.com"
      table_name: "main.default.does_not_exist"
      auth:
        strategy: oauth
        client_id: "${DATABRICKS_CLIENT_ID}"
        client_secret: "${DATABRICKS_CLIENT_SECRET}"
      healthcheck:
        enabled: false

Before: Vector exits at startup with Unity Catalog API returned error 404: ....
After: Vector starts; batches are rejected at ingest time with the same error logged per batch.

How did you test this PR?

  • Manual smoke tests still to do:
  • Non-existent table with default settings → sink starts, events rejected.
  • Bad OAuth credentials → sink starts, events rejected.
  • healthcheck.enabled = false with unreachable UC → sink starts.
  • require_healthy = true → Vector exits (opt-in fail-fast preserved).

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

@github-actions github-actions Bot added the domain: sinks Anything related to the Vector's sinks label May 11, 2026
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from 30bffb2 to 83ebd94 Compare May 11, 2026 16:45
@flaviofcruz flaviofcruz marked this pull request as ready for review May 11, 2026 19:46
@flaviofcruz flaviofcruz requested a review from a team as a code owner May 11, 2026 19:46
@github-actions github-actions Bot added the domain: ci Anything related to Vector's CI environment label May 12, 2026
Comment thread .github/workflows/semantic.yml

@petere-datadog petere-datadog left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, works as expected, code looks good to me

@pront pront left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex review

also a few checks are failing

Comment thread changelog.d/25408_databricks_zerobus_lazy_schema.fix.md Outdated
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from 3e6c015 to bdd9e09 Compare May 20, 2026 20:26
@flaviofcruz flaviofcruz requested a review from pront May 20, 2026 21:08
@pront pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 22, 2026
@pront pront enabled auto-merge May 22, 2026 16:38
@pront pront added this pull request to the merge queue May 22, 2026
@pront pront removed this pull request from the merge queue due to a manual request May 22, 2026
@pront

pront commented May 22, 2026

Copy link
Copy Markdown
Member

please add databricks_zerobus sink to semantic.yml

@flaviofcruz

Copy link
Copy Markdown
Contributor Author

please add databricks_zerobus sink to semantic.yml

It is added here: https://github.com/flaviofcruz/vector/blob/bdd9e0910af6bd3fd2c2459222f402d823b99858/.github/workflows/semantic.yml But I can merge a separate PR if that makes it easier.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2cea4a861

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sinks/databricks_zerobus/service.rs
Comment thread src/sinks/databricks_zerobus/unity_catalog_schema.rs Outdated
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from f2cea4a to 9b65cc6 Compare May 26, 2026 21:47
When the Tower retry budget is exhausted on a retryable failure (UC 5xx/408/429,
network blip, transient SDK error), the driver previously mapped the resulting
Err to EventStatus::Rejected — a permanent drop, indistinguishable from a UC
404 or an EncodingError. Ack-enabled sources and disk buffers had no way to
know a replay would have helped.

This change wires a small Tower layer outside the existing retry stack:
- Ok responses pass through unchanged.
- Err that downcasts to a retryable ZerobusSinkError (or doesn't downcast at
  all — conservative) becomes Ok(ZerobusResponse::errored()), so the driver
  sets Errored and the source / disk buffer may replay.
- Err that is permanent (EncodingError, SchemaError { retryable: false },
  MissingAckOffset, non-retryable SDK errors) still propagates so the driver
  maps it to Rejected.

ZerobusResponse now carries an EventStatus field with delivered() / errored()
constructors; ingest() returns ZerobusResponse::delivered(...).

Also brings unity_catalog_schema::status_is_retryable into line with Vector's
canonical RetryStrategy::Default by delegating to it. The visible effect: UC
501 (Not Implemented) is now classified as permanent (was retryable under the
prior blanket 5xx rule).

Co-authored-by: Isaac
@flaviofcruz flaviofcruz force-pushed the zerobus-defer-descriptor branch from 9b65cc6 to cd7fe9f Compare May 26, 2026 21:49
@petere-datadog petere-datadog enabled auto-merge May 27, 2026 15:45
@petere-datadog petere-datadog added this pull request to the merge queue May 27, 2026
Merged via the queue into vectordotdev:master with commit ee870f9 May 27, 2026
58 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

domain: ci Anything related to Vector's CI environment domain: sinks Anything related to the Vector's sinks no-changelog Changes in this PR do not need user-facing explanations in the release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants